Locking in Week 5 — one defensible page that says where a system can be hit and what holds
Day 25 of 60
Five days ago, robustness was probably a word that meant "the model is good." Now it's a discipline. You can explain why jailbreaks work (competing objectives, mismatched generalization), why prompt injection is an architectural problem and not a prompting one, why agents turn content vulnerabilities into action vulnerabilities, and — with the Go result — why capability never implies robustness. And you've built the artifact that ties it together: a coverage matrix that shows which layer stops which attack and where the gaps are.
A capable model is a soft layer that will sometimes be fooled. Robustness is what you build around it: layered, measured, monitored defenses, with the dangerous capabilities gated so that when a trick works, it still can't do anything catastrophic. You don't hope the model isn't jailbreakable — you assume it is, and engineer for that.
This week's capstone is a single page you could hand to a team before shipping a tool-using agent: an attack-surface map. It names each way in, the defenses on that path, and the honest residual gap. Building it forces every concept from the week into one place.
Where untrusted input reaches the model: the user's own messages (direct jailbreak), retrieved content (indirect injection — web, email, files, tool output), and non-text channels (multimodal evasion). One row per entry point.
For each entry point, which of your layers apply — input filter, system prompt, safety tuning, output filter, provenance, least privilege, human-in-the-loop, action monitoring. Pull this straight from your Day 23 matrix.
Name the path with the thinnest coverage, and be honest about where you have real guarantees versus best-effort. Some properties can be formally verified or hard-gated (an agent that physically lacks a permission cannot use it); most model-level defenses are statistical and bypassable. Mark which is which — that honesty is the artifact's value.
Formal verification and hard permission boundaries give real guarantees about what the system is allowed to do — and those guarantees hold even against attacks no one imagined. They cannot, today, guarantee what a language model will say on every input. So the durable defenses are the architectural ones: gate the capabilities, and the model's softness stops being catastrophic. That's the single most important sentence to carry out of Part A's safety-engineering arc.
A practitioner lists the attacks they know. An expert hands you a map: every entry point, the layered defenses on it, the honest residual gap, and a clear line between what's guaranteed and what's best-effort. The altitude jump is from cataloguing threats to governing them — owning the one page that decides whether a system is ready to ship, and being able to defend every cell of it in a review.
Say this in an interview: "Before shipping an agent I build an attack-surface map: entry points, defenses per path, residual gaps, and a clear line between hard guarantees and statistical best-effort. I assume the model can be jailbroken, so I gate the dangerous capabilities — that way a successful trick still can't cause a catastrophe. Robustness is the stack, never the model alone."